The data used in this excercise contains information collected by the U.S. Census Service concerning housing in the area of Boston Mass. It has been previous released to the public domain and has been used extensively to benchmark differen algorithms.
The data first appeared in: Harrison, D. and Rubinfeld, D.L. `Hedonic prices and the demand for clean air’, J. Environ. Economics & Management, vol.5, 81-102, 1978.
The data was originally used to relate housing price with various predictors.
For an explanation of the individual variables, please, refer to the following:
| Variable name | Explanation |
|---|---|
| crim | per capita crime rate by town/suburb. |
| zn | proportion of residential land zoned for lots over 25,000 sq.ft. |
| indus | proportion of non-retail business acres per town. |
| chas | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). |
| nox | nitrogen oxides concentration (parts per 10 million). |
| rm | average number of rooms per dwelling. |
| age | proportion of owner-occupied units built prior to 1940. |
| dis | weighted mean of distances to five Boston employment centres. |
| rad | index of accessibility to radial highways. |
| tax | full-value property-tax rate per $10,000. |
| ptratio | pupil-teacher ratio by town. |
| black | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. |
| lstat | lower status of the population (percent). |
| medv | median value of owner-occupied homes in $1000s. |
#Read the Boston data
library(MASS)
data("Boston")
#Draw an interactive table that can be used to view the data and its dimensions
library(DT)
datatable(Boston)
library(gtsummary)
#First print out a summary table of all variables
my_table <- tbl_summary(Boston,
digits = all_continuous() ~ 2,
type = all_continuous() ~ "continuous2",
statistic = list(all_continuous() ~ c("{mean} ({sd})",
"{median} ({p25}, {p75})",
"{min}, {max}"),
all_categorical() ~ "{n} / {N} ({p}%)"))
#Add some missing elements to the table and print it out
my_table %>% bold_labels()
| Characteristic | N = 506 |
|---|---|
| crim | |
| Mean (SD) | 3.61 (8.60) |
| Median (IQR) | 0.26 (0.08, 3.68) |
| Range | 0.01, 88.98 |
| zn | |
| Mean (SD) | 11.36 (23.32) |
| Median (IQR) | 0.00 (0.00, 12.50) |
| Range | 0.00, 100.00 |
| indus | |
| Mean (SD) | 11.14 (6.86) |
| Median (IQR) | 9.69 (5.19, 18.10) |
| Range | 0.46, 27.74 |
| chas | 35 / 506 (6.9%) |
| nox | |
| Mean (SD) | 0.55 (0.12) |
| Median (IQR) | 0.54 (0.45, 0.62) |
| Range | 0.38, 0.87 |
| rm | |
| Mean (SD) | 6.28 (0.70) |
| Median (IQR) | 6.21 (5.89, 6.62) |
| Range | 3.56, 8.78 |
| age | |
| Mean (SD) | 68.57 (28.15) |
| Median (IQR) | 77.50 (45.02, 94.07) |
| Range | 2.90, 100.00 |
| dis | |
| Mean (SD) | 3.80 (2.11) |
| Median (IQR) | 3.21 (2.10, 5.19) |
| Range | 1.13, 12.13 |
| rad | |
| 1 | 20 / 506 (4.0%) |
| 2 | 24 / 506 (4.7%) |
| 3 | 38 / 506 (7.5%) |
| 4 | 110 / 506 (22%) |
| 5 | 115 / 506 (23%) |
| 6 | 26 / 506 (5.1%) |
| 7 | 17 / 506 (3.4%) |
| 8 | 24 / 506 (4.7%) |
| 24 | 132 / 506 (26%) |
| tax | |
| Mean (SD) | 408.24 (168.54) |
| Median (IQR) | 330.00 (279.00, 666.00) |
| Range | 187.00, 711.00 |
| ptratio | |
| Mean (SD) | 18.46 (2.16) |
| Median (IQR) | 19.05 (17.40, 20.20) |
| Range | 12.60, 22.00 |
| black | |
| Mean (SD) | 356.67 (91.29) |
| Median (IQR) | 391.44 (375.38, 396.22) |
| Range | 0.32, 396.90 |
| lstat | |
| Mean (SD) | 12.65 (7.14) |
| Median (IQR) | 11.36 (6.95, 16.96) |
| Range | 1.73, 37.97 |
| medv | |
| Mean (SD) | 22.53 (9.20) |
| Median (IQR) | 21.20 (17.02, 25.00) |
| Range | 5.00, 50.00 |
library(GGally)
#Print a scatter plot matrix
ggpairs( data = Boston,
upper = list(continuous = "points", combo = "facethist", discrete = "facetbar", na = "na"))
We can see various types of correlation in the data. There are examples of linear correlation for some pairs of variables (e.g. rm vs medv) and also non-linear correlations (e.g. nox vs dis).
As a continuous variable crime seems to have a chi-squared type of distribution with a long tail towards higher values.
Crime seems to have some potential interesting correlations: for example, somewhat curiously higher age seems to be associated with increasing crime levels. Other variable with a potential positive correlation with crime are nox and lstat dis and medv may have a negative correlation with crime.
There are also various variables with “conditional” correlations with crime like zn, indus, chas, rad, tax, and ptratio.
#Scale and standardize the data as instructed
boston_scaled <- scale(Boston) %>% as.data.frame
#Yet another way to summarize variables:
library(psych)
describe(boston_scaled)
## vars n mean sd median trimmed mad min max range skew kurtosis
## crim 1 506 0 1 -0.39 -0.22 0.04 -0.42 9.92 10.34 5.19 36.60
## zn 2 506 0 1 -0.49 -0.27 0.00 -0.49 3.80 4.29 2.21 3.95
## indus 3 506 0 1 -0.21 -0.03 1.37 -1.56 2.42 3.98 0.29 -1.24
## chas 4 506 0 1 -0.27 -0.27 0.00 -0.27 3.66 3.94 3.39 9.48
## nox 5 506 0 1 -0.14 -0.08 1.12 -1.46 2.73 4.19 0.72 -0.09
## rm 6 506 0 1 -0.11 -0.05 0.73 -3.88 3.55 7.43 0.40 1.84
## age 7 506 0 1 0.32 0.09 1.03 -2.33 1.12 3.45 -0.60 -0.98
## dis 8 506 0 1 -0.28 -0.12 0.91 -1.27 3.96 5.22 1.01 0.46
## rad 9 506 0 1 -0.52 -0.09 0.34 -0.98 1.66 2.64 1.00 -0.88
## tax 10 506 0 1 -0.46 -0.05 0.64 -1.31 1.80 3.11 0.67 -1.15
## ptratio 11 506 0 1 0.27 0.10 0.79 -2.70 1.64 4.34 -0.80 -0.30
## black 12 506 0 1 0.38 0.29 0.09 -3.90 0.44 4.34 -2.87 7.10
## lstat 13 506 0 1 -0.18 -0.11 1.00 -1.53 3.55 5.07 0.90 0.46
## medv 14 506 0 1 -0.14 -0.11 0.64 -1.91 2.99 4.89 1.10 1.45
## se
## crim 0.04
## zn 0.04
## indus 0.04
## chas 0.04
## nox 0.04
## rm 0.04
## age 0.04
## dis 0.04
## rad 0.04
## tax 0.04
## ptratio 0.04
## black 0.04
## lstat 0.04
## medv 0.04
As can be seen from the summary of boston_scaled, all variable means were “moved” to 0 and the standard deviations were scaled to equal 1. In more general statistical terms the data was standardized with mean of 0 and sd of 1. It should be noted that standardization of data does not mean that data’s type of distribution were altered, e.g. if the distribution of the variable was skewed before it’ll still be skewed after standardization.
#Include a convenience library
library(dvmisc)
#Create a discrete variable "crime" from quartiles of "crim"
boston_scaled$crime <- quant_groups(x = boston_scaled$crim,
groups = 4,
cut.list=list(labels = c("low","med_low","med_high","high")))
## Observations per group: 127, 126, 126, 127. 0 missing.
#Drop the crim variable
boston_scaled <- boston_scaled[,!(names(boston_scaled) %in% "crim")]
#Demonstrate the outcome
str(boston_scaled)
## 'data.frame': 506 obs. of 14 variables:
## $ zn : num 0.285 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.287 -0.593 -0.593 -1.306 -1.306 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.144 -0.74 -0.74 -0.834 -0.834 ...
## $ rm : num 0.413 0.194 1.281 1.015 1.227 ...
## $ age : num -0.12 0.367 -0.266 -0.809 -0.511 ...
## $ dis : num 0.14 0.557 0.557 1.077 1.077 ...
## $ rad : num -0.982 -0.867 -0.867 -0.752 -0.752 ...
## $ tax : num -0.666 -0.986 -0.986 -1.105 -1.105 ...
## $ ptratio: num -1.458 -0.303 -0.303 0.113 0.113 ...
## $ black : num 0.441 0.441 0.396 0.416 0.441 ...
## $ lstat : num -1.074 -0.492 -1.208 -1.36 -1.025 ...
## $ medv : num 0.16 -0.101 1.323 1.182 1.486 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 1 1 1 1 1 1 2 2 2 2 ...
This code is more or less self explanatory (and a bit shorter than the datacamp example).
#Generate the sample indices
#Not that you must use floor() to get the division exactly right
#if n_rows * 0.8 is not an even number
n_rows <- nrow(boston_scaled)
my_training_sample <- sample(n_rows, size = floor(n_rows * 0.8))
#Training data set
boston_train <- boston_scaled[my_training_sample, ]
#Testing data set
boston_test <- boston_scaled[-my_training_sample, ]
This code chunk should be self explanatory
#Generate the lda fit for leves of crime with all other variables
#as discriminators
my_lda <- lda(crime ~ ., data = boston_train)
#Print the lda fit
my_lda
## Call:
## lda(crime ~ ., data = boston_train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2376238 0.2574257 0.2549505 0.2500000
##
## Group means:
## zn indus chas nox rm age
## low 0.8712058 -0.8897148 -0.14929469 -0.8605452 0.41698413 -0.8282938
## med_low -0.1100038 -0.2843050 -0.04518867 -0.5512599 -0.17006439 -0.3431242
## med_high -0.3864997 0.1786143 0.18636222 0.3330771 0.06542333 0.4087441
## high -0.4872402 1.0149946 -0.03844192 1.0725546 -0.38036113 0.8055247
## dis rad tax ptratio black lstat
## low 0.8695726 -0.6935581 -0.7445669 -0.38940176 0.37819567 -0.75838409
## med_low 0.3280958 -0.5556132 -0.4846949 -0.07495078 0.31655676 -0.09966545
## med_high -0.3931898 -0.4321681 -0.3292425 -0.31669651 0.09561409 0.01301714
## high -0.8399359 1.6596029 1.5294129 0.80577843 -0.84107918 0.81969206
## medv
## low 0.50463189
## med_low -0.01308089
## med_high 0.19056331
## high -0.63054010
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11311586 0.71056762 -0.98553803
## indus 0.03777699 -0.14658684 0.19779157
## chas -0.08314873 -0.12350563 0.09793218
## nox 0.27713250 -0.71984608 -1.35734174
## rm -0.14001516 -0.03055067 -0.17733743
## age 0.27656741 -0.23022994 -0.28308629
## dis -0.07605808 -0.09994327 0.07651933
## rad 3.81546542 0.96096876 -0.06231064
## tax -0.04078446 -0.13003493 0.66148107
## ptratio 0.15784081 0.14375014 -0.29947165
## black -0.14792118 0.00457879 0.15418612
## lstat 0.17211973 -0.23024355 0.42869668
## medv 0.18496051 -0.27846165 -0.21906840
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9630 0.0273 0.0097
The prior probabilities of groups are simply the proportion of observations in each group in the training data. The reason these prior probabilities are printed is that they’re necessary in calculating the so-called posterior probability which is used for calssification of the observations.
In the group means table, we can see the means of all independent variables for each quartile of crime.
One way to understand the coefficients of linear discriminants is that they are a coefficients of a linear function for calculating a “score value” (y) from all independent variables in the model. The coefficients for this function are calculated such that the group means of y are as far away from the mean of y for the whole data while also minimizing the variance of y for each individual group. Thus, the values of y will represent the best one dimensional projection of all the independent variables for telling the groups apart from each other.
There are multiple sets of linear discriminant coefficients in the output due to the fact that there are multiple levels of crime. In fact, if there are k groups then the total number of linear discriminant functions will be k-1. This is due to the fact that k groups can have k-1 combinations of coefficients for the independent variables that optimally tell apart the groups.
The proportion of trace for each linear discriminant function is the amount of (multivariate) between-group variance that the dircriminant function is able to explain. For example, if all the group means are on a single multidimensional line for all independent variables, the first discriminant function will be able to explain 100% of the between-group variance. On the other hand, if all scaled group means are equidistant from the mean of the whole data, while being equally distant from all other groups, all discriminant functions will be able to explain an equal proportion of the between-group variance. In such a case, all discriminant functions (or dimensions) are equally important for predicting the group.
In our case the first linear discriminant function was able to explain 95.8% of the between-group variance in our data. The two remaining discriminant functions were of very little added value for telling the groups apart.
Since our variables were previously scaled, directly comparing the scalar values of the linear discriminant function coefficients is possible. In the case of LD1, for example, we clearly see that rad was the variable that was the most significant discriminant for LD1.
#We'll use another 3rd party library for generating the biplot
#devtools::install_github("fawda123/ggord")
library(ggord)
#Print the biplot as requested
ggord(my_lda,boston_train$crime)
Only the two most significant linear discriminant functions (LD1 and LD2) are used for this plot. LD1 and LD2 are calculated for each observation and the observations are plotted as a scatter plot of these two values.
The arrows in the plot represent the linear discriminant coefficients for both of these functions (LD1/LD2). The longer the arrow, the greater the coefficient. I.e. we can clearly see that for the combined effect of LD1 and LD2 rad is clearly the most significant discriminant. Again, it should be noted that this comparison makes sense only when the independent variables are scaled.
library(dplyr)
#First save the correct crime classes in the test data separately, as instructed
#(although this step is completely unnecessary)
correct_crime_classes <- boston_test$crime
#Drop the correct crime class from the test data
boston_test <- boston_test[, !(names(boston_test) %in% "crime")]
#Predict crime class based on the test data and our lda fit
crime_predictions <- predict(my_lda, newdata = boston_test)
#Cross tabulate the results
results_table <- as.data.frame(correct_crime_classes)
results_table$predicted_class <- crime_predictions$class
str(results_table)
## 'data.frame': 102 obs. of 2 variables:
## $ correct_crime_classes: Factor w/ 4 levels "low","med_low",..: 1 2 3 3 3 3 3 1 1 2 ...
## $ predicted_class : Factor w/ 4 levels "low","med_low",..: 1 2 2 2 2 3 2 2 1 2 ...
tbl_cross( results_table,
row = correct_crime_classes,
col = predicted_class,
label = list(correct_crime_classes ~ "Correct class", predicted_class ~ "Predicted class"),
percent ="cell")
| Characteristic | Predicted class | Total | |||
|---|---|---|---|---|---|
| low | med_low | med_high | high | ||
| Correct class | |||||
| low | 20 (20%) | 11 (11%) | 0 (0%) | 0 (0%) | 31 (30%) |
| med_low | 6 (5.9%) | 16 (16%) | 0 (0%) | 0 (0%) | 22 (22%) |
| med_high | 1 (1.0%) | 6 (5.9%) | 14 (14%) | 2 (2.0%) | 23 (23%) |
| high | 0 (0%) | 0 (0%) | 1 (1.0%) | 25 (25%) | 26 (25%) |
| Total | 27 (26%) | 33 (32%) | 15 (15%) | 27 (26%) | 102 (100%) |
#Calculate the proportion of correct classifications
mean(correct_crime_classes == crime_predictions$class)
## [1] 0.7352941
The LDA model was able to predict the correct crime class in 73.5% of cases in our test data. Compared to the pre-test propability of roughly 25% (in the whole data), this performance is very good.
#Reload the boston dataset and scale it
boston_scaled <- as.data.frame(scale(Boston))
#Calculate the matrix for euclidean distances for all observations
euclidean_distances <- dist(boston_scaled)
#Print a summary
summary(euclidean_distances)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
#Print info on the matrix
str(euclidean_distances)
## 'dist' num [1:127765] 1.94 2.28 2.57 2.69 2.24 ...
## - attr(*, "Size")= int 506
## - attr(*, "Labels")= chr [1:506] "1" "2" "3" "4" ...
## - attr(*, "Diag")= logi FALSE
## - attr(*, "Upper")= logi FALSE
## - attr(*, "method")= chr "euclidean"
## - attr(*, "call")= language dist(x = boston_scaled)
The values in this matrix are the multivariate euclidean (all variables in boston_scaled) distances between all observations (rows) in the data table. The size of such matrix is 506^2. However, only (506^2 - 506) / 2 = 127765 unique values exist in the matrix as can be seen from the code output.
#Please note that I have previously set the "global" seed for this r-project
#Visualize the behaviour of TWCSS vs the number of clusters
qplot( x = 1:10,
y = sapply(1:10, function(k){kmeans(boston_scaled, k)$tot.withinss}),
geom = 'line',
ylab = "TWCSS")
As can be seen from the graph, theres a steep drop for total within cluster sum of squares at k = 2 but not thereafter. Therefore, the data, based on the kmeans clustering has two distinct multivariate clusters. I.e. there are no discernible sub-clusters within the two clusters (in which case increasing the k would reduce the TWCSS dramatically).
#Perform k-means clustering with the optimal number of clusters
#Save the assigned cluster into the table
ggpairs( data = boston_scaled,
upper = list(continuous = "points", combo = "facethist", discrete = "facetbar", na = "na"),
mapping = aes(col = kmeans(boston_scaled, 2)$cluster,alpha = 0.3))
It is important to recognize that k-means clustering is an unsupervized method for classification of data. In other words, we have provided no limits for the classes/clusters for the algorithm. Indeed, it’s possible that the clusters assigned by the algorithm are not necessarily meaningful. They simply represent multivariate “concentrations” of observations.
From the scatter matrix we can see that, interestingly, higher vs lower levels of crime seem to present as two discernible multivariate clusters in our data. Since the increased crime rates have been assigned to the light blue cluster almost exclusively, we can also view the rest of the scatter matrix as contrasting what variables are associated with higher levels of crime (light blue).
From the scatter matrix, we can also visualize various interactions in the data that separate the clusters like lstat vs medv; i.e. a combination of high lstat and low medv have been assigned to the light blue cluster.
2.1.1 Comments on the data tables
From the table we can see that some of the variables (like crime, zn, and tax) have a quite skewed distributions. By exploring the interactive data table presented previously, we can see that tax has multiple discrete levels and some of these levels have disproportionately large numbers of observations (like 711). zn has observations of 0 in most cases (median = 0) although the range is 0 to 100.